DROPS

Document

APPROX

DOI: 10.4230/LIPIcs.APPROX/RANDOM.2020.46

Improved Circular k-Mismatch Sketches

Authors: Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat, and Przemysław Uznański

Published in: LIPIcs, Volume 176, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)

Abstract

The shift distance sh(S₁,S₂) between two strings S₁ and S₂ of the same length is defined as the minimum Hamming distance between S₁ and any rotation (cyclic shift) of S₂. We study the problem of sketching the shift distance, which is the following communication complexity problem: Strings S₁ and S₂ of length n are given to two identical players (encoders), who independently compute sketches (summaries) sk(S₁) and sk(S₂), respectively, so that upon receiving the two sketches, a third player (decoder) is able to compute (or approximate) sh(S₁,S₂) with high probability. This paper primarily focuses on the more general k-mismatch version of the problem, where the decoder is allowed to declare a failure if sh(S₁,S₂) > k, where k is a parameter known to all parties. Andoni et al. (STOC'13) introduced exact circular k-mismatch sketches of size Õ(k+D(n)), where D(n) is the number of divisors of n. Andoni et al. also showed that their sketch size is optimal in the class of linear homomorphic sketches. We circumvent this lower bound by designing a (non-linear) exact circular k-mismatch sketch of size Õ(k); this size matches communication-complexity lower bounds. We also design (1± ε)-approximate circular k-mismatch sketch of size Õ(min(ε^{-2}√k, ε^{-1.5}√n)), which improves upon an Õ(ε^{-2}√n)-size sketch of Crouch and McGregor (APPROX'11).

Cite as

Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat, and Przemysław Uznański. Improved Circular k-Mismatch Sketches. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 176, pp. 46:1-46:24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{golan_et_al:LIPIcs.APPROX/RANDOM.2020.46,
  author =	{Golan, Shay and Kociumaka, Tomasz and Kopelowitz, Tsvi and Porat, Ely and Uzna\'{n}ski, Przemys{\l}aw},
  title =	{{Improved Circular k-Mismatch Sketches}},
  booktitle =	{Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)},
  pages =	{46:1--46:24},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-164-1},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{176},
  editor =	{Byrka, Jaros{\l}aw and Meka, Raghu},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX/RANDOM.2020.46},
  URN =		{urn:nbn:de:0030-drops-126492},
  doi =		{10.4230/LIPIcs.APPROX/RANDOM.2020.46},
  annote =	{Keywords: Hamming distance, k-mismatch, sketches, rotation, cyclic shift, communication complexity}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.5

Time-Space Tradeoffs for Finding a Long Common Substring

Authors: Stav Ben-Nun, Shay Golan, Tomasz Kociumaka, and Matan Kraus

Published in: LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Abstract

We consider the problem of finding, given two documents of total length n, a longest string occurring as a substring of both documents. This problem, known as the Longest Common Substring (LCS) problem, has a classic 𝒪(n)-time solution dating back to the discovery of suffix trees (Weiner, 1973) and their efficient construction for integer alphabets (Farach-Colton, 1997). However, these solutions require Θ(n) space, which is prohibitive in many applications. To address this issue, Starikovskaya and Vildhøj (CPM 2013) showed that for n^{2/3} ≤ s ≤ n, the LCS problem can be solved in 𝒪(s) space and 𝒪̃(n²/s) time. Kociumaka et al. (ESA 2014) generalized this tradeoff to 1 ≤ s ≤ n, thus providing a smooth time-space tradeoff from constant to linear space. In this paper, we obtain a significant speed-up for instances where the length L of the sought LCS is large. For 1 ≤ s ≤ n, we show that the LCS problem can be solved in 𝒪(s) space and 𝒪̃(n²/(L⋅s) +n) time. The result is based on techniques originating from the LCS with Mismatches problem (Flouri et al., 2015; Charalampopoulos et al., CPM 2018), on space-efficient locally consistent parsing (Birenzwige et al., SODA 2020), and on the structure of maximal repetitions (runs) in the input documents.

Cite as

Stav Ben-Nun, Shay Golan, Tomasz Kociumaka, and Matan Kraus. Time-Space Tradeoffs for Finding a Long Common Substring. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 5:1-5:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bennun_et_al:LIPIcs.CPM.2020.5,
  author =	{Ben-Nun, Stav and Golan, Shay and Kociumaka, Tomasz and Kraus, Matan},
  title =	{{Time-Space Tradeoffs for Finding a Long Common Substring}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{5:1--5:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.5},
  URN =		{urn:nbn:de:0030-drops-121302},
  doi =		{10.4230/LIPIcs.CPM.2020.5},
  annote =	{Keywords: longest common substring, time-space tradeoff, local consistency, periodicity}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.15

The Streaming k-Mismatch Problem: Tradeoffs Between Space and Total Time

Authors: Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, and Ely Porat

Published in: LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Abstract

We revisit the k-mismatch problem in the streaming model on a pattern of length m and a streaming text of length n, both over a size-σ alphabet. The current state-of-the-art algorithm for the streaming k-mismatch problem, by Clifford et al. [SODA 2019], uses Õ(k) space and Õ(√k) worst-case time per character. The space complexity is known to be (unconditionally) optimal, and the worst-case time per character matches a conditional lower bound. However, there is a gap between the total time cost of the algorithm, which is Õ(n√k), and the fastest known offline algorithm, which costs Õ(n + min(nk/√m, σn)) time. Moreover, it is not known whether improvements over the Õ(n√k) total time are possible when using more than O(k) space. We address these gaps by designing a randomized streaming algorithm for the k-mismatch problem that, given an integer parameter k≤s≤m, uses Õ(s) space and costs Õ(n+min(nk²/m, nk/√s, σnm/s)) total time. For s=m, the total runtime becomes Õ(n + min(nk/√m, σn)), which matches the time cost of the fastest offline algorithm. Moreover, the worst-case time cost per character is still Õ(√k).

Cite as

Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, and Ely Porat. The Streaming k-Mismatch Problem: Tradeoffs Between Space and Total Time. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{golan_et_al:LIPIcs.CPM.2020.15,
  author =	{Golan, Shay and Kociumaka, Tomasz and Kopelowitz, Tsvi and Porat, Ely},
  title =	{{The Streaming k-Mismatch Problem: Tradeoffs Between Space and Total Time}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{15:1--15:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.15},
  URN =		{urn:nbn:de:0030-drops-121406},
  doi =		{10.4230/LIPIcs.CPM.2020.15},
  annote =	{Keywords: Streaming pattern matching, Hamming distance, k-mismatch}
}

Document

DOI: 10.4230/LIPIcs.ICALP.2018.65

Towards Optimal Approximate Streaming Pattern Matching by Matching Multiple Patterns in Multiple Streams

Authors: Shay Golan, Tsvi Kopelowitz, and Ely Porat

Published in: LIPIcs, Volume 107, 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)

Abstract

Recently, there has been a growing focus in solving approximate pattern matching problems in the streaming model. Of particular interest are the pattern matching with k-mismatches (KMM) problem and the pattern matching with w-wildcards (PMWC) problem. Motivated by reductions from these problems in the streaming model to the dictionary matching problem, this paper focuses on designing algorithms for the dictionary matching problem in the multi-stream model where there are several independent streams of data (as opposed to just one in the streaming model), and the memory complexity of an algorithm is expressed using two quantities: (1) a read-only shared memory storage area which is shared among all the streams, and (2) local stream memory that each stream stores separately. In the dictionary matching problem in the multi-stream model the goal is to preprocess a dictionary D={P_1,P_2,...,P_d} of d=|D| patterns (strings with maximum length m over alphabet Sigma) into a data structure stored in shared memory, so that given multiple independent streaming texts (where characters arrive one at a time) the algorithm reports occurrences of patterns from D in each one of the texts as soon as they appear. We design two efficient algorithms for the dictionary matching problem in the multi-stream model. The first algorithm works when all the patterns in D have the same length m and costs O(d log m) words in shared memory, O(log m log d) words in stream memory, and O(log m) time per character. The second algorithm works for general D, but the time cost per character becomes O(log m+log d log log d). We also demonstrate the usefulness of our first algorithm in solving both the KMM problem and PMWC problem in the streaming model. In particular, we obtain the first almost optimal (up to poly-log factors) algorithm for the PMWC problem in the streaming model. We also design a new algorithm for the KMM problem in the streaming model that, up to poly-log factors, has the same bounds as the most recent results that use different techniques. Moreover, for most inputs, our algorithm for KMM is significantly faster on average.

Cite as

Shay Golan, Tsvi Kopelowitz, and Ely Porat. Towards Optimal Approximate Streaming Pattern Matching by Matching Multiple Patterns in Multiple Streams. In 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 107, pp. 65:1-65:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{golan_et_al:LIPIcs.ICALP.2018.65,
  author =	{Golan, Shay and Kopelowitz, Tsvi and Porat, Ely},
  title =	{{Towards Optimal Approximate Streaming Pattern Matching by Matching Multiple Patterns in Multiple Streams}},
  booktitle =	{45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)},
  pages =	{65:1--65:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-076-7},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{107},
  editor =	{Chatzigiannakis, Ioannis and Kaklamanis, Christos and Marx, D\'{a}niel and Sannella, Donald},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2018.65},
  URN =		{urn:nbn:de:0030-drops-90690},
  doi =		{10.4230/LIPIcs.ICALP.2018.65},
  annote =	{Keywords: Streaming approximate pattern matching, Dictionary matching}
}

Document

DOI: 10.4230/LIPIcs.ESA.2017.41

Real-Time Streaming Multi-Pattern Search for Constant Alphabet

Authors: Shay Golan and Ely Porat

Published in: LIPIcs, Volume 87, 25th Annual European Symposium on Algorithms (ESA 2017)

Abstract

In the streaming multi-pattern search problem, which is also known as the streaming dictionary matching problem, a set D={P_1,P_2, . . . ,P_d} of d patterns (strings over an alphabet Sigma), called the dictionary, is given to be preprocessed. Then, a text T arrives one character at a time and the goal is to report, before the next character arrives, the longest pattern in the dictionary that is a current suffix of T. We prove that for a constant size alphabet, there exists a randomized Monte-Carlo algorithm for the streaming dictionary matching problem that takes constant time per character and uses O(d log m) words of space, where m is the length of the longest pattern in the dictionary. In the case where the alphabet size is not constant, we introduce two new randomized Monte-Carlo algorithms with the following complexities: * O(log log |Sigma|) time per character in the worst case and O(d log m) words of space. * O(1/epsilon) time per character in the worst case and O(d |\Sigma|^epsilon log m/epsilon) words of space for any 0<epsilon<= 1. These results improve upon the algorithm of [Clifford et al., ESA'15] which uses O(d log m) words of space and takes O(log log (m+d)) time per character.

Cite as

Shay Golan and Ely Porat. Real-Time Streaming Multi-Pattern Search for Constant Alphabet. In 25th Annual European Symposium on Algorithms (ESA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 87, pp. 41:1-41:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{golan_et_al:LIPIcs.ESA.2017.41,
  author =	{Golan, Shay and Porat, Ely},
  title =	{{Real-Time Streaming Multi-Pattern Search for Constant Alphabet}},
  booktitle =	{25th Annual European Symposium on Algorithms (ESA 2017)},
  pages =	{41:1--41:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-049-1},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{87},
  editor =	{Pruhs, Kirk and Sohler, Christian},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2017.41},
  URN =		{urn:nbn:de:0030-drops-78550},
  doi =		{10.4230/LIPIcs.ESA.2017.41},
  annote =	{Keywords: multi-pattern, dictionary, streaming pattern matching, fingerprints}
}

Document

DOI: 10.4230/LIPIcs.ESA.2016.44

Streaming Pattern Matching with d Wildcards

Authors: Shay Golan, Tsvi Kopelowitz, and Ely Porat

Published in: LIPIcs, Volume 57, 24th Annual European Symposium on Algorithms (ESA 2016)

Abstract

In the pattern matching with d wildcards problem we are given a text T of length n and a pattern P of length m that contains d wildcard characters, each denoted by a special symbol '?'. A wildcard character matches any other character. The goal is to establish for each m-length substring of T whether it matches P. In the streaming model variant of the pattern matching with d wildcards problem the text T arrives one character at a time and the goal is to report, before the next character arrives, if the last m characters match P while using only o(m) words of space. In this paper we introduce two new algorithms for the d wildcard pattern matching problem in the streaming model. The first is a randomized Monte Carlo algorithm that is parameterized by a constant 0<=delta<=1. This algorithm uses ~O(d^{1-delta}) amortized time per character and ~O(d^{1+delta}) words of space. The second algorithm, which is used as a black box in the first algorithm, is a randomized Monte Carlo algorithm which uses O(d+log m) worst-case time per character and O(d log m) words of space.

Cite as

Shay Golan, Tsvi Kopelowitz, and Ely Porat. Streaming Pattern Matching with d Wildcards. In 24th Annual European Symposium on Algorithms (ESA 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 57, pp. 44:1-44:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{golan_et_al:LIPIcs.ESA.2016.44,
  author =	{Golan, Shay and Kopelowitz, Tsvi and Porat, Ely},
  title =	{{Streaming Pattern Matching with d Wildcards}},
  booktitle =	{24th Annual European Symposium on Algorithms (ESA 2016)},
  pages =	{44:1--44:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-015-6},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{57},
  editor =	{Sankowski, Piotr and Zaroliagis, Christos},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2016.44},
  URN =		{urn:nbn:de:0030-drops-63561},
  doi =		{10.4230/LIPIcs.ESA.2016.44},
  annote =	{Keywords: wildcards, don't-cares, streaming pattern matching, fingerprints}
}

Search Results

Documents authored by Golan, Shay

Improved Circular k-Mismatch Sketches

Abstract

Cite as

Time-Space Tradeoffs for Finding a Long Common Substring

Abstract

Cite as

The Streaming k-Mismatch Problem: Tradeoffs Between Space and Total Time

Abstract

Cite as

Towards Optimal Approximate Streaming Pattern Matching by Matching Multiple Patterns in Multiple Streams

Abstract

Cite as

Real-Time Streaming Multi-Pattern Search for Constant Alphabet

Abstract

Cite as

Streaming Pattern Matching with d Wildcards

Abstract

Cite as

Thanks for your feedback!

Could not send message